About MEBS

The main goal of MEBS is capture with a single value the importance of complex metabolic pathways or biogeochemical cycles in a large omic datasets (either genomes or metagenomes). The algortithm has been thoroughly tested with the sulfur cycle, but currently other cycles are also supported. Use the script mebs.pl to avoid reading the manual and score your own genome/metagnome in terms of biogeochemical cycles. All that is required is a directory containing peptide FASTA files of encoded proteins/fragments with .faa extension.

MEBS installation

The MEBS software is available as an open-source package distributed from a GitHub repository. Thus, the natural way of installing it is by cloning the repository via the following commands:

git clone https://github.com/eead-csic-compbio/metagenome_Pfam_score

#Alternatively, a ZIP file can be downloaded and then unpacked:

unzip metagenome_Pfam_score-master.zip

Requisites

Before you start, make sure you have hmmserch installed, v3.1b1 or greater, otherwise the program will generate errors, see issue 1

Running MEBS

perl mebs.pl

  usage: mebs.pl [options] 

   -help    Brief help message
   
   -input   Folder containing FASTA peptide files (.faa)                  (required)

   -type    Nature of input sequences, either 'genomic' or 'metagenomic'  (required)

   -fdr     Score cycles with False Discovery Rate 0.1 0.01 0.001 0.0001  (optional, default=0.01)

   -cycles  Show currently supported biogeochemical cycles
  
   -comp   Compute the metabolic completeness                            (optional)
  

MEBS supported cycles

The following biogeochemical cycles are ready to use with MEBS:

  1. sulfur: Includes the mobilization of inorganic and inorganic sulfur compounds
  2. carbon:Usage of CH4 compounds by methanotrophs, methanogens, and methylotrophs
  3. oxygen: Represented by oxygenic photosynthesis
  4. iron: The Fe reduction and oxidation including also siderophores uptake
  5. nitrogen: We included the pathways involved in the reduction and oxidation of both inorganic (nitrate(+5) to ammonia(-3) ) and organic nitrogen compounds (i.e taurine, urea, and choline degradation)
perl mebs.pl -cycles

# Available cycles:
sulfur
carbon
oxygen
iron
nitrogen

Modalities

The main modalities of MEBS are

  1. Score: Capturing the metabolic machinery of your genome or metagenome in terms of single scores: Basic Mode
  2. Markers: Detect possible marker genes according to their informational content (entropy)
  3. Completeness: Evaluate the metabolic completeness of metabolic pathways
  4. Kegg visualization: Visualizate the protein domains in your genome or metagenome using KEGG visualization Modalities

Basic Mode

Starting point

MEBS assumes that your sequencing data is in fasta format (.faa) extension in an especific directory. Example of the input data can be foun in test_genomes/ or test_metagenomes/ directories

head  -3 test_genomes/*.faa 
==> test_genomes/Archaeoglobus_profundus_DSM_5631.faa <==
>WP_012766394.1 hypothetical protein [Archaeoglobus profundus]
MGSQEVGRIEEEVVEERRQEEEEIDEEEATGSALLTAEEFDKKIEEIKAKVREAVQEALANTLADLLEKEEKEEKRKDET
KAEELPKCPKNLSWLKKMFEILPLDILRNSKLWRYRHCVWALQEAEKEAEKFRNS

==> test_genomes/Enterococcus_durans.faa <==
>WP_000053907.1 MULTISPECIES: replication control protein PrgN [Bacilli]
MSLKNYVYSHPVNVFIIGRLGLSVEVFCELYGFKQGTLSSWVTREKTVASLPIEFLHGLSLASGETMDQVYDCLCVLEQE
YIEYKIANELRKRKKYIQ

Running MEBS

To run MEBS you only need to specifyt the input folder and the type of data (either genomic or metagenomic). If you are working with MAGs (metagenome-assembled genomes), use the option ‘genomic’. You need to speciy the type of file for MEBS to allocate the pre-computed entropies for each type of data considering the fragmentary nature of the metagenomic sequences.

Genomic data

perl mebs.pl  -input test_genomes/ -type genomic 
       sulfur   carbon  oxygen  iron    nitrogen
Enterococcus_durans.faa -0.063  0.284   0.883   0.214   3.044
Archaeoglobus_profundus_DSM_5631.faa    11.434* 24.834* 1.493   0.765   6.873

The scores that meets the criteria of specific FDR are shown in asterisc, yet the score will be the same regardless of the FDR that is used. If the Score if greater or equal to the FDR, then an asterisc will be shown in the output. In the case of using the default FDR (0.01), more false positive will be obtained, for example the genome Archaeoglobus profundus a well known microorgnism involved in the S-cycle, could seem to have a CH4 metabolism by using a default FDR,however if we increase to FDR 0.001, the C cycle asterisc is gone and only the S-cycle ramain. Therefore, we recomend a more restrictive FDR in order to eliminate false positives.

perl mebs.pl  -input test_genomes/ -type genomic -fdr 0.001

           sulfur       carbon  oxygen  iron    nitrogen
Enterococcus_durans.faa -0.063  0.284   0.883   0.214   3.044
Archaeoglobus_profundus_DSM_5631.faa    11.434  24.834* 1.493   0.765   6.873

If you attempt to benchmark your own metabolism, we recomend to add your own FDR values in this config file at the end of this file. In a 16.04 Ubuntu system, 16Gb RAM, intel Inside i7 the time to run the scritpt in the example folder is less than 20 seconds.

real    0m14.183s
user    0m22.961s
sys 0m0.865s

Comparing the resutls

If you want to compare your results with a dataset of 2107 non-redundant genomes plese click at the following TABLE.

Also, If you want to perform your own analysis and comparisions (i.e clustering your data to see which is the genome with more metabolic relateness in terms of metabolic capabilities) you can have acces to the Table 6 Described in the MEBS database De Anda V , Zapata-Penasco I, Poot-Hernandez AC et al. Multigenomic Entropy Based Score (MEBS): the molecular reconstruction of the sulfur cycle. GigaScience Database 2017. http://dx.doi.org/10.5524/100357.

Metagenomic data

In the case of using metagenomic data metagenomic data the mean size length (MSL) and the size of the allocated entropies (MSLbin) is especify as warning. If you redirect the output of the script to a file this information will not be printed.

perl mebs.pl -input test_metagenomes -type metagenomic
# mebs.pl -input test_metagenomes -type metagenomic -fdr 0.01 -comp 0

# Computing Mean Size Length (MSL) ...
# 4511045.3_metagenome.faa MSL=32 MSLbin=30
# 4440966.3_metagenome.faa MSL=175 MSLbin=150

    sulfur  carbon  oxygen  iron    nitrogen
4511045.3_metagenome.faa    -2.295  1.790   5.412   2.745   13.024
4440966.3_metagenome.faa    5.817   8.804   1.178   4.579   11.697

As MEBS needs to compute the MSL of each metagenome the time will depend on the size of the input sample. In the test_metagenomes/ directory there are only two metagenomes with size of 2,8M and 6,5M respectively. The computation time will depend on the size and the number of metagenomes in the input forlder. In the example it takes less than 40 seconds to finish.

real    0m36.736s
user    1m40.328s
sys 0m2.599s

Real example: downloading and scoring MG-RAST data

Starting point

Create a file with the id’s of interest. As example, we provide a list of microbialites (microbial mats and stromatolites) and other environments publicly available from MG-RAST. Please have a look at the table containing metadata of the metagenomic dataset used in De Anda et al., 2017

MG-RAST id Type and origin Reference
4449591.3 4449590.3 Stromatolites from Highborne Cay, Bahamas (Khodadad & Foster, 2012)
4445126.3 4445129.3 Polar microbial mats (Antartic) (Varin et al., 2012)
4442467.3 4442466.3 4441363.3 4441347.3 Freshwater microbial mats from Cuatro Cienegas Basin (CCB (Bonilla-Rosso et al., 2012; Peimbert et al., 2012)
4443746.3 4443747.3 4443762.3 4443749.3 4443750.3 HMicrobial mats from Yellowstone National Park (Bhaya et al., 2007)
4454153.3 Purple sulfur bacteria biofilm (Wilbanks et al., 2014)
4440060.4 4440067.3 Stromatolite and thrombolite from CCB (Breitbart et al., 2009; Desnues et al., 2008)
4487624.3 4487625.3 Hydrothermal vents Taiwan (Tang et al., 2013)
4441138.3 4441137.3 Acid Mine Dreinage (AMD) mats (Jiao et al.,2011)
4449206.3 Hydrothermal vent Andes Colombia (Jiménez et al., 2012)
4491734.3 Polar cryconite (Edwards et al., 2013)
4744007.3 4744008.3 4744009.3 4744010.3 4744011.3 4744012.3 4744013.3 4744014.3 4744015.3 4744016.3 4744017.3 4744018.3 Microbial mats from Churince CCB (De Anda et al., in review)

We have also included freshwater microbrialites from Pavilion Lake, Clinton Creek described in Allen-White III et al., 2016, Allen-White III., et al., 2016, Respectively, hypersaline from marine environements described in Ruvindy et al., 2016

head 
4449591.3
4449590.3
4445126.3
4445129.3
4442467.3
4442466.3
4441363.3
4441347.3

Download the files with the faa extensions

This can take an hour depending on your download bandwith

cd test_mats 
time for line in `cat IDmats.txt`; do wget "http://api.metagenomics.anl.gov/1/download/mgm$line?file=350.1" -O $line.faa ; done

Compute MEBS

Use restrictive FDR and save the output in a tabular file

perl mebs.pl -input test_mats -input test_mats -type metagenomic -fdr 0.0001 -comp > scores.tab
# mebs.pl -input test_mats -type metagenomic -fdr 0.0001 -comp 1

# Computing Mean Size Length (MSL) ...
# 4441588.3.faa MSL=146 MSLbin=150
# 4442467.3.faa MSL=99 MSLbin=100
# 4443749.3.faa MSL=210 MSLbin=200
# 4441137.3.faa MSL=209 MSLbin=200
# 4441571.3.faa MSL=213 MSLbin=200
# 4442466.3.faa MSL=75 MSLbin=60
# 4493725.3.faa MSL=223 MSLbin=200
........

Lets have a look at the output

less -S  scores.tab
        sulfur  carbon  oxygen  iron    nitrogen        <sulfur comp>   sulfur_1        sulfur_2        sulfur_3        sulfur_4        sulfur_5        sulfur_6        sulfur_7        sulfur_8        sulf
4441588.3.faa   -2.367  -1.331  0.944   1.173   1.801   29.9    0.0     10.0    0.0     0.0     0.0     0.0     11.1    0.0     40.0    5.0     10.3    0.0     66.7    50.0    14.3    12.5    66.7    57.1
4442467.3.faa   3.098   8.511   7.088   8.758   17.550  78.7    

Analyzing MEBS output

Completeness

The output from MEBS is a tabular file text, where the columns indicate the scores (sulfur, carbon, oxygen iron, nitrogen), and the completeness of the pathways related to sulfur and carbon cycle).

To see which pathway belongs to eaach column and compute a heatmap with the completeness of the patwhays please have a look at the following example

scores2plot

scores2plot

Use the script completeness.py to create a heatmap with the percentage of completeness

python3 completeness.py scores.tab 

Grouping data according to their scores

Futhermore, you can use ClustVis, to visualizate your data. Just upload the scores2plot.tab file, transpose the data and generate the PCA

PCA

PCA

Tree

If you compute a MEBS in a dataset and you already have a phylogenetic tree of those genomes you can map MEBS across the phylogeneic tree, such as in this figure. Note that the values from the heatmap are normalize from 0-1. To perform the normalization please have a look at the following notebook

Phylogenetic tree with mebs

Phylogenetic tree with mebs

Maximum scores

To compare your data with the maximum scores that you can obtain from the entropy data, have a look at the following data If you sare computing MEBS in genomes compare your results with the row “Genomic data”. In the case that you are computing MEBS in metagenomes see the corresponding MSL and MSLbin you compare your results.

sulfur methane oxygen iron nitrogen
Genomic data 16.018 85.332 10.703 10.464 22.079
30 13.676 84.503 10.438 8.843 20.642
60 16.818 85.347 11.253 9.567 22.148
100 15.566 85.221 9.965 10.676 21.43
150 15.848 84.81 10.152 10.316 21.379
200 15.887 84.765 10.463 9.832 21.938
250 16.031 85.057 10.387 10.215 21.853
300 15.929 84.942 10.569 10.284 21.968

Summary of MEBS basic mode

Below we summarize the internal steps performed with MEBS in the basic mode using the main script mebs.pl

MEBS flowchart basic mode

MEBS flowchart basic mode

Advance mode

External dependencies for advance mode

The following external packages are required if you want to benchmark your own metabolic pathway. Interproscan and hmmsearch are needed in order to annotate Pfam domains within peptide sequences. The rest of packages are needed to run the full pipeline, which comprises four steps.

  1. Interproscan
  2. Python3
  3. Matplotlib 1.4 or greater
  4. Numpy
  5. Pandas
  6. Scikit-learn
  7. Jupyter-notebook
  8. MPL_toolkits

Scoring your data: Train your own classifier. Advanced Mode

For more advanced uses a manual is provided. The required input data are:

  1. FASTA file with peptides sequences of proteins involved in the cycle/pathway of interest.
  2. List of RefSeq accesions of (curated) genomes known to be involved in the cycle/pathway of interest.

These inputs are processed in order to train a classifier which internally uses Pfam domains.

As seen above, genomes or metagenomes provided by the user can then be scored with the trained classifier. Once a classifier has been trained, such as the Sulfur cycle, steps 1 and 3 can be skipped.

MEBS flowchart advance mode

MEBS flowchart advance mode

Support and Development

Planned feature improvements are publicly catalogued at the main MEBS development site on github. Bug reports and problems using MEBS are welcome on the issues tracker. We prefer posting to the issue tracker over email as these posts are searchable by other users who may experience the same problems.

Talk to us

We’d love to hear from you. Any comments and suggestions can be sent to Valerie De Anda valdeanda@ciencias.unam.mx

Cite

If you find this software usefull please cite us as:

  • De Anda V, Zapata-Penasco I, Poot Hernandez AC, Fruns LE, Contreras Moreira B, Souza V (2017) MEBS, a software platform to evaluate large (meta)genomic collections according to their metabolic machinery: unraveling the sulfur cycle. doi:10.1093/gigascience/gigascience/gix096/4561660

2018-09-16